I’m here to learn new skills and improve olds. The course fits perdectly to my needs. I have used R a lot, but havent shared my code outside of our research group, so learning good practises is very welcome. I did go throw Exercise1 ‘warm up’ pretty fast and everything was pretty familiar. As a note for my self, the syntax and used functions was many way different than what i have used to get same goal. Might be good to start to use less ‘intermediate’ objects and more ‘tidyverse’ style. I like to use Atom for writing a script, so it’s be nice to try to do things differently.
My github: https://github.com/jvehvi/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Nov 28 20:21:31 2022"
This week topic was Regression and model validation. Here we present basic R commands to build up summary tables and linear models which can be used for finding cause-and-effect relationships between different variables.
# Set up workind dir
setwd("E:/Open science course 2022/IODS-project")
dir.create(paste0(getwd(),"/Data"))
## Warning in dir.create(paste0(getwd(), "/Data")): 'E:\Open science course
## 2022\IODS-project\Data' already exists
setwd(paste0(getwd(),"/Data"))
Bring .txt file to R environment which has been downloaded to Data - folder.
learning_2014<-as.data.frame(read.table("Data/learning2014.txt", header = TRUE, sep = "\t"))
Table should have 183 rows and 60 cols containing integer values in all except last column which contains information about gender by character.
# Remove rows which have 0 in Points
analy_dataset<-subset(learning_2014, as.numeric(Points)!=0)
analy_dataset[analy_dataset==0]<-NA
# Build new dataframe (DF) for needed data. Gender, Age and Points are added straigth but for start Deep, Stra and Surf has been set to NA.
df<-data.frame(Gender=analy_dataset$gender,Age=analy_dataset$Age,Attitude=NA, Deep=NA,Stra=NA, Surf=NA, Points=analy_dataset$Points)
# Add Deep to DF by taking the mean
deep <- c("D03","D11","D19","D27","D03","D11","D19","D27","D06","D15","D23","D31")
df$Deep<-rowMeans(analy_dataset[,colnames(analy_dataset) %in% deep], na.rm=TRUE)
# Add Stra to DF by taking the mean
stra <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")
df$Stra<-rowMeans(analy_dataset[,colnames(analy_dataset) %in% stra], na.rm=TRUE)
# Add Surf to DF by taking the mean
surf <- c("SU02","SU10","SU18","SU26","SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
df$Surf<-rowMeans(analy_dataset[,colnames(analy_dataset) %in% surf], na.rm=TRUE)
# Add Attitude to DF by taking the mean
attitude <- c("Da","Db","Dc","Dd","De","Df","Dg","Dh","Di","Dj")
df$Attitude<-rowMeans(analy_dataset[,colnames(analy_dataset) %in% attitude], na.rm=TRUE)
# Change working dir
setwd("E:/Open science course 2022/IODS-project")
# Write results to .csv format file
write.csv2(df,"Data/learning2014.csv")
# Read saved .csv file. Header= TRUE means that file has headers. sep=";" means that values are separated by ;. dec="," means that in the file ',' has been used for decimals
learning_2014.2<-as.data.frame(read.table("Data/learning2014.csv", header = TRUE, sep = ";", dec=","))
Data set contains summary results of course ‘Johdatus yhteiskuntatilastotieteeseen, syksy 2014’ survey. There should be 7 variables (gender, age, attitude, deep, stra, surf and points) and 166 observations.’’’ Gender: Male = 1 Female = 2 Age: Age (in years) derived from the date of birth Attitude: Global attitude toward statistics. Mean of original variables (~Da+Db+Dc+Dd+De+Df+Dg+Dh+Di+Dj) Deep: Deep approach. Mean of original variables (~d_sm+d_ri+d_ue) Stra: Strategic approach. Mean of original variables ( ~st_os+st_tm) Surf: Surface approach. Mean of original variables (~su_lp+su_um+su_sb) Points: Total counts from survey. More information about used variables can be found from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt’’’
# Read file to R environment
learning_2014.2<-as.data.frame(read.table("Data/learning2014.csv", header = TRUE, sep = ";", dec=","))
# Check dimensions of read table
dim(learning_2014.2)
## [1] 166 8
By commend summary we can take a look ‘inside’ the data. By the command we get ‘boxplot information’ about our numerical data in dataframe. Scatterplot matrix is used to describe relationships between the variables. It’s constructed from the dataframe with ggpairs -function (ggplot2 -package). Result plot shows, in addition of variables relationships, variables diverging and gives correlation coefficients with asterix showing level of significance.
# Summary
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
summary(learning_2014.2[,2:ncol(learning_2014.2)])
## Gender Age Attitude Deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.375
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.250
## Mode :character Median :22.00 Median :3.200 Median :3.750
## Mean :25.51 Mean :3.143 Mean :3.645
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.000
## Max. :55.00 Max. :5.000 Max. :4.875
## Stra Surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
# Scatterplot
p <- ggpairs(learning_2014.2[-(1:2)], mapping = aes(), lower =list(combo = wrap("facethist", bins = 20)))
print(p)
Most promising relationship seems to be between: Attitude and Points,
and Surf and Deep.
There seems to be also some kind of relationship between: Stra and Surf.
As overall, matrix gives good overlook of data, and starting point to
study more relationships between variables’’’
Create a regression model with multiple explanatory variables
my_model2 <- lm(Points ~ Attitude + Stra + Surf, data = learning_2014.2)
summary(my_model2)
##
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = learning_2014.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## Attitude 3.3952 0.5741 5.913 1.93e-08 ***
## Stra 0.8531 0.5416 1.575 0.11716
## Surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Summary of a regression model shows that only Attitude seems to correlate significantly with Points. From the print, you can see model residiuals; summary table (estimate value, std. error, t-value and p-value for all variables in model against Points). Significance of variable correlation can be read from p-value(last column of Coefficients table). Significance levels threshols are given under the table. Summary gives also p-value for whole model which isn’t significant because model contains variables that hasn’t have relationship to Points. In next step, lest remove other variables from the model and see what happens
# Only Attitude seems to be significant, so lets do model again with only adding it
my_model3 <- lm(Points ~ Attitude, data = learning_2014.2)
summary(my_model3)
##
## Call:
## lm(formula = Points ~ Attitude, data = learning_2014.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## Attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Now results seems to be better and p-value is significant for the variable relationship as well as for the model. Multiple R-squared is the proportion of the variation in dependent variable that can be explained by the independent variable. So in the model where we haves three variables 20.74 % of the variation in Points can be explained by variables. But now interesting is that Attitude by itself explains 19.06 % of the variation. Showing us that Stra and Surf effect to the points is pretty minimal.
plot(my_model3, which=c(1,2,5))
From the ‘Residual vs leverage’ - plot we can check which and how many of observation are influential. In our case data seems good and there isnt any point outside Cook distance lines.
Also ‘Residual vs. Fitted’ - plot seems good. Data is divided evenly in x - and y-axle.
QQ-plot also indicates goodness of our model. If the points runs out too early from the line, there might be some other variables effecting our relationship more than the Attitude variable. In this case QQplot seems to be really nice, but noT perfect. ```
For the study material from UCI Machine Learning Repository, Student Performance Data (incl. Alcohol consumption) page (https://archive.ics.uci.edu/ml/datasets/Student+Performance) has been used as a source data table. The joined data set which will be used in the analysis, contains two student alcohol consumption data sets from the web page. The variables not used for joining the two data have been combined by averaging (including the grade variables): + ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’ + ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise More information about variables can be found from: https://archive.ics.uci.edu/ml/datasets/Student+Performance
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 0.3.5
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# Read data table to R - environment
wrangled_student_mat_por <- as.data.frame(read_csv("Data/wrangled_student_mat_por.csv"))
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl (1): high_use
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(wrangled_student_mat_por)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
For studying variables relationships to alcohol, I have selected four variables that I hypothesize to have relationship to person alcohol consumption.
Age - I think that persons over 18 use more alcohol.
romantic - I think that single persons drinks more often.
studytime - I think persons which study more weekly, drink less.
freetime - I think persons which have more free time might drink more.
library(ggplot2)
library(GGally)
# Summary table
int_variables<-c("age","romantic","studytime","freetime","alc_use","high_use")
summary(wrangled_student_mat_por[int_variables])
## age romantic studytime freetime
## Min. :15.00 Length:370 Min. :1.000 Min. :1.000
## 1st Qu.:16.00 Class :character 1st Qu.:1.000 1st Qu.:3.000
## Median :17.00 Mode :character Median :2.000 Median :3.000
## Mean :16.58 Mean :2.043 Mean :3.224
## 3rd Qu.:17.00 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :22.00 Max. :4.000 Max. :5.000
## alc_use high_use
## Min. :1.000 Mode :logical
## 1st Qu.:1.000 FALSE:259
## Median :1.500 TRUE :111
## Mean :1.889
## 3rd Qu.:2.500
## Max. :5.000
# Unique values for romantic column
unique(wrangled_student_mat_por$romantic)
## [1] "no" "yes"
# Group by romantic before summarise
wrangled_student_mat_por %>% group_by(romantic) %>% summarise(count = n())
## # A tibble: 2 × 2
## romantic count
## <chr> <int>
## 1 no 251
## 2 yes 119
# Scatterplot
p <- ggpairs(wrangled_student_mat_por[int_variables], mapping = aes(), lower =list(combo = wrap("facethist", bins = 20)))
p
Results shows that there seems to be some kind of relationship at least between alchol and freetime, - studytime and age. POsitive correlation can be seem between alcohol use and freetime. Negative correlation between alcohol use and study time, and alcohol use and age. Intresting find out is that young persons (15-18) seems to drink more than ages 19 -21.
Use glm() to fit a logistic regression model with
high_use as the target variable and
freetime, studytime, age and romantic as the
predictors.
m <- glm(high_use ~ freetime + studytime + age + factor(romantic), data = wrangled_student_mat_por, family = "binomial")
# Summary
summary(m)
##
## Call:
## glm(formula = high_use ~ freetime + studytime + age + factor(romantic),
## family = "binomial", data = wrangled_student_mat_por)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6628 -0.8579 -0.6654 1.1899 2.3332
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.4470 1.7762 -2.504 0.012293 *
## freetime 0.3269 0.1238 2.641 0.008274 **
## studytime -0.5850 0.1572 -3.721 0.000198 ***
## age 0.2246 0.1025 2.190 0.028512 *
## factor(romantic)yes -0.2158 0.2594 -0.832 0.405404
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 421.66 on 365 degrees of freedom
## AIC: 431.66
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.01171392 0.0003399901 0.3664590
## freetime 1.38664482 1.0914115708 1.7751196
## studytime 0.55710460 0.4053055149 0.7516885
## age 1.25176377 1.0261046822 1.5355541
## factor(romantic)yes 0.80586877 0.4806650258 1.3321754
The results shows that widest range is for variable ‘romantic’. Range also contains 1 so most probably alcohol_consumption and romantic don’t have relationships, more like the variables are independent of each other. Freetime and age has OD and condidence interval > 1, so there seems to be positive correlation between those and high_use. Studytime values are < 1 meaning negative correlation between it and high_use.
Next, I have build up new model excluding ‘romantic’ which seems to have no-relationship to students alcohol use. After building model, I have taken summary information about the model and calculated the mean prediction error.
library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Let's select only freetime, studytime and age for the next model
m <- glm(high_use ~ freetime + studytime + age, data = wrangled_student_mat_por, family = "binomial")
# Summary
summary(m)
##
## Call:
## glm(formula = high_use ~ freetime + studytime + age, family = "binomial",
## data = wrangled_student_mat_por)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6143 -0.8690 -0.6895 1.2088 2.2716
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.2829 1.7668 -2.424 0.015345 *
## freetime 0.3246 0.1236 2.626 0.008630 **
## studytime -0.5928 0.1575 -3.765 0.000167 ***
## age 0.2119 0.1014 2.089 0.036700 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 422.36 on 366 degrees of freedom
## AIC: 430.36
##
## Number of Fisher Scoring iterations: 4
# Making predictions on the train set.
# Probability
wrangled_student_mat_por <- mutate(wrangled_student_mat_por, probability = predict(m, type = "response"))
# Prediction
wrangled_student_mat_por <- mutate(wrangled_student_mat_por, prediction = probability > 0.5)
# 2x2 cross tabulation of predictions versus the actual values
table(high_use = wrangled_student_mat_por$high_use, prediction = wrangled_student_mat_por$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 245 14
## TRUE 96 15
# Making predictions on the test set.
test_pred<- ifelse(predict(m, newdata = wrangled_student_mat_por, type = "response") > 0.5, "TRUE", "FALSE")
# 2x2 cross tabulation of predictions versus the actual values
table(high_use = wrangled_student_mat_por$high_use, prediction = test_pred)
## prediction
## high_use FALSE TRUE
## FALSE 245 14
## TRUE 96 15
# We can study our model goodness more by looking ConfusionMatrix of the results
confusionMatrix(table(high_use = wrangled_student_mat_por$high_use, prediction = test_pred))
## Confusion Matrix and Statistics
##
## prediction
## high_use FALSE TRUE
## FALSE 245 14
## TRUE 96 15
##
## Accuracy : 0.7027
## 95% CI : (0.6533, 0.7488)
## No Information Rate : 0.9216
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1028
##
## Mcnemar's Test P-Value : 1.136e-14
##
## Sensitivity : 0.7185
## Specificity : 0.5172
## Pos Pred Value : 0.9459
## Neg Pred Value : 0.1351
## Prevalence : 0.9216
## Detection Rate : 0.6622
## Detection Prevalence : 0.7000
## Balanced Accuracy : 0.6179
##
## 'Positive' Class : FALSE
##
# Prediction plot
# access dplyr and ggplot2
library(dplyr); library(ggplot2)
# initialize a plot of 'high_use' versus 'probability' in 'wrangled_student_mat_por'
# define the geom as points and draw the plot
g <- ggplot(wrangled_student_mat_por, aes(x = high_use, y =as.numeric(probability))) +
geom_point(aes(y = as.numeric(probability)), alpha = 0.2)
# ROC curve
library('pROC')
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
test_prob <- predict(m, newdata = wrangled_student_mat_por, type = "response")
test_roc <- roc(wrangled_student_mat_por$high_use ~ test_prob, plot = TRUE, print.auc = TRUE)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
test_roc
##
## Call:
## roc.formula(formula = wrangled_student_mat_por$high_use ~ test_prob, plot = TRUE, print.auc = TRUE)
##
## Data: test_prob in 259 controls (wrangled_student_mat_por$high_use FALSE) < 111 cases (wrangled_student_mat_por$high_use TRUE).
## Area under the curve: 0.6808
# Define a loss function (mean prediction error)
calc_class_err <- function(actual, predicted) {
mean(actual != predicted)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
calc_class_err(actual = wrangled_student_mat_por$high_use, predicted = wrangled_student_mat_por$prediction)
## [1] 0.2972973
# call loss_func to compute the average number of wrong predictions in the (test) data
# Error rate should be near each other with training and testing data
calc_class_err(actual = wrangled_student_mat_por$high_use, predicted = test_pred)
## [1] 0.2972973
Our results shows that approximately 30 % of our predictions are wrong, so the accuracy for our model is 0.7 which is at least better than pure guess. We selected 0.5 as a threshold for classification, that value is our choice and it goodness for accuracy can be study from ROC-curve. In this study, it’s okay. Test error rate is same as train error rate, so our model seems to be fitted properly. When predictions are compared to high_use we can see that our predictions have too many < 2 points drinker compared to real situation, and reversal too less high user.
Lets perform 10-fold cross-validation for our model to see if we get better test set performance. 10-fold means that all our observations are grouped to 10 bins
library(caret)
#specify the cross-validation method
ctrl <- trainControl(method = "cv", number = 10)
#fit a regression model and use k-fold CV to evaluate performance
model <- train(factor(high_use) ~ freetime + studytime + age, data = wrangled_student_mat_por, family = "binomial", method = "glm", trControl = ctrl)
model
## Generalized Linear Model
##
## 370 samples
## 3 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 333, 332, 333, 333, 333, 333, ...
## Resampling results:
##
## Accuracy Kappa
## 0.697321 0.07746201
# Show also summary of model
summary(model)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6143 -0.8690 -0.6895 1.2088 2.2716
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.2829 1.7668 -2.424 0.015345 *
## freetime 0.3246 0.1236 2.626 0.008630 **
## studytime -0.5928 0.1575 -3.765 0.000167 ***
## age 0.2119 0.1014 2.089 0.036700 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 422.36 on 366 degrees of freedom
## AIC: 430.36
##
## Number of Fisher Scoring iterations: 4
#View final model parameters
model$finalModel
##
## Call: NULL
##
## Coefficients:
## (Intercept) freetime studytime age
## -4.2829 0.3246 -0.5928 0.2119
##
## Degrees of Freedom: 369 Total (i.e. Null); 366 Residual
## Null Deviance: 452
## Residual Deviance: 422.4 AIC: 430.4
The results shows that our accuracy is app. 0.70 meaning that 30 % of our predictions are wrong. Actually, results are almost same than with Logistic regression model, so in this case there isn’t any point to use 10-fold cross-validation. Typically, as k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller but there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically choice for k has been done using k=10 or k=5 because those has been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance According summary table of model, study time has most negative impact to alcohol use. Freetime has most positive correlation to high_use, and age variable has also significant effect to drinking habit. Basically, results seems to be line with earlier results.
For testing and comparing different models performances, I have select 10 variables as a starting point and removed ‘the worst’ one after one that we have three variables left. Selected variables are: * famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
health - current health status (numeric: from 1 - very bad to 5 - very good)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
age - student’s age (numeric: from 15 to 22)
The data has been divided to test and training sets, so we could better understood those two relationships.
# select only columns under interest
wrangled_student_mat_por_sub <- wrangled_student_mat_por[,colnames(wrangled_student_mat_por) %in% c("famsize","Medu","Fedu","traveltime","famrel","sex","health","freetime","studytime","age","high_use")]
# Divide data to test and training sets by using 70 % of the data for the training set
dat <- list()
dat[['index']] <- sample(nrow(wrangled_student_mat_por_sub), size = nrow(wrangled_student_mat_por_sub) * 0.7)
dat[['training']] <- wrangled_student_mat_por_sub[dat$index, ]
dat[['test']] <- wrangled_student_mat_por_sub[-dat$index,]
# Empty lists
count_of_variables <- test_mses <- training_mses <- list()
# For loop to test different counts of variables
for (i in 1:(ncol(wrangled_student_mat_por_sub)- 1)) {
# Build new model every iteration containing one variable less
train_mod <-glm(high_use~., data = dat$training[,i:ncol(dat$training)])
# Predict values for training data
y_hat_training <- predict(train_mod, type = "response")
train_predict = y_hat_training > 0.5
# Collect training error
training_mses[[i]] <- calc_class_err(actual = dat$training$high_use, predicted = train_predict)
# Predict values for testing data
y_hat_test <- predict(train_mod, newdata = dat$test)
test_predict = y_hat_test > 0.5
test_mses[[i]] <- calc_class_err(actual = dat$test$high_use, predicted = test_predict)
count_of_variables[[i]] <- length(train_mod$coefficients) - 1
}
# Build dataframe from results to be used in ggplot
df<-data.frame(Number_of_Variables=unlist(count_of_variables), Test_error=unlist(test_mses),
Training_error=unlist(training_mses))
# Build plot
library("reshape2")
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
df_long<-melt(df, id="Number_of_Variables",)
df_long$Number_of_Variables<-factor(df_long$Number_of_Variables, levels=df$Number_of_Variables)
a<-ggplot(df_long, aes(x = Number_of_Variables,y=value, color=variable, group=variable)) +
geom_point() +
geom_line() +
geom_smooth() +
scale_x_discrete(limits = rev(levels(df$Number_of_Variables)))
a
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
In the plot, test - and training errors against ‘Number of prediction variables in model’ are shown with own lines. I have selected ten variables which might have effect to alcohol consumption. Typically, training error which is less than test error indicates that model over fits to the training data. The model can be think to be good enough when test and training errors are near each other. E.g. from the plot, can be see that training error first rise in range 10-8 and in same range test error is lower and remains almost unchanged. Over fitting can be seen with range 7-5 and 3-1. When we have four variables in our model, test and training error are near each others indicating goodness of our model. And this results, seems to change depending how data is divided to test and training sets. According these results, the logistic regression model with four variables seems to be good if selected variables are good. Different grouping for selected variables should be also performed to catch out which four variables would be the best ones.
This assignment topic is clustering and classification. As basic, clustering means that in the data some points/observations are in space so near each other compared to other points/observations that those form a ‘own cluster’. There are multiple different clustering methods to find those clusters from the data. One of the most common methods is k-means clustering, others worth naming are hierarchical clustering methods which gives dendrograms as a output. If the clustering can be done successfully, new observations can be tried to classify to those clusters.
To clarify this topic, Boston dataset from MASS - package will be used as a demonstration dataset.
The Boston Dataset
The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The Boston data frame has 506 rows and 14 columns. The following describes the dataset columns:
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## corrplot 0.92 loaded
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
## [1] 506 14
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## black lstat medv
## crim -0.38506394 0.4556215 -0.3883046
## zn 0.17552032 -0.4129946 0.3604453
## indus -0.35697654 0.6037997 -0.4837252
## chas 0.04878848 -0.0539293 0.1752602
## nox -0.38005064 0.5908789 -0.4273208
## rm 0.12806864 -0.6138083 0.6953599
## age -0.27353398 0.6023385 -0.3769546
## dis 0.29151167 -0.4969958 0.2499287
## rad -0.44441282 0.4886763 -0.3816262
## tax -0.44180801 0.5439934 -0.4685359
## ptratio -0.17738330 0.3740443 -0.5077867
## black 1.00000000 -0.3660869 0.3334608
## lstat -0.36608690 1.0000000 -0.7376627
## medv 0.33346082 -0.7376627 1.0000000
There
are 506 different observations for 14 variables whivh summaries
informations concerning housing in the area of Boston Mass. By using
this dataset, we can try to find out different relationships between
variables. E.g. per capita crime rate relationship to other variables is
very interesting. If we could find out variables which has the most
effect to it, could in future try to effect these variables and in
advance reduce develop of crime rate when planning new residential
areas. From the summary table, we can see that not all information from
variables are easy or wise to use as a predictors in same model as those
are now. E.g. “proportion of residential land zoned for lots over 25,000
sq.ft.” (zn) - variable observations are unevenly distributed where as
“pupil-teacher ratio by town” (ptratio) - variable observations are much
more evenly distributed. But interesting is that when we plot other
variables against crime rate, we can see that there could be correlation
even that the distribution wouldn’t be so clear. From the correlation
matrix, we can see that there are some kind positive or negative
correlation with per capita crime rate and other variables. Six
variables have negative correlation and seven have positive. Most less
negative correlation is with “Charles River dummy variable (1 if tract
bounds river; 0 otherwise)” (chas) and overall negative correlate
factors impact seems to be low. Positively correlate factors impact
seems to be much higher and “index of accessibility to radial highways”
(rad) - and “full-value property-tax rate per $10,000” (tax) seems to
have the most positive correlation to per capita crime rate. Same
results are visualized in the plot where conclusions can be made more
easily.
To get more clear understanding about the data and make it usable in prediction model, it has to be standardized. To do that, we scale and center the columns of a numeric matrix. Centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns. Scaling is done by dividing the (centered) columns of x by their standard deviations.
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
## crim zn indus chas nox rm age
## 1 -0.4193669 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
## dis rad tax ptratio black lstat medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990 0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324 1.3229375
## 4 1.076671 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708 1.1815886
## 5 1.076671 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866 1.4860323
## 6 1.076671 -0.7521778 -1.1050216 0.1129203 0.4101651 -1.0422909 0.6705582
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
## [1] "data.frame"
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## black lstat medv
## crim -0.38506394 0.4556215 -0.3883046
## zn 0.17552032 -0.4129946 0.3604453
## indus -0.35697654 0.6037997 -0.4837252
## chas 0.04878848 -0.0539293 0.1752602
## nox -0.38005064 0.5908789 -0.4273208
## rm 0.12806864 -0.6138083 0.6953599
## age -0.27353398 0.6023385 -0.3769546
## dis 0.29151167 -0.4969958 0.2499287
## rad -0.44441282 0.4886763 -0.3816262
## tax -0.44180801 0.5439934 -0.4685359
## ptratio -0.17738330 0.3740443 -0.5077867
## black 1.00000000 -0.3660869 0.3334608
## lstat -0.36608690 1.0000000 -0.7376627
## medv 0.33346082 -0.7376627 1.0000000
After
scaling we can see that values across different variables are much more
near each others. From the correlation matrix, we see that we got same
results as with raw values.
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
## crim zn indus chas nox rm age
## 1 -0.4193669 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
## dis rad tax ptratio black lstat medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990 0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324 1.3229375
## 4 1.076671 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708 1.1815886
## 5 1.076671 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866 1.4860323
## 6 1.076671 -0.7521778 -1.1050216 0.1129203 0.4101651 -1.0422909 0.6705582
# Bring Boston data table from github to R
boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt", sep=",", header = T)
# Change 'crime' to factor variable with relevant levels
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
# Build up test and train datasets
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
dim(train)
## [1] 404 14
# create test set
test <- boston_scaled[-ind,]
dim(test)
## [1] 102 14
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
head(test)
## zn indus chas nox rm age
## 7 0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.3880270 -0.07015919
## 16 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.6413655 -0.42896588
## 17 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.4976172 -1.39525719
## 23 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.2030044 0.82152875
## 26 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.9758293 0.60837625
## 27 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.6712537 0.77179316
## dis rad tax ptratio black lstat
## 7 0.83841422 -0.5224844 -0.5769480 -1.503749 0.4263763 -0.03123671
## 16 0.33411879 -0.6373311 -0.6006817 1.175303 0.4265954 -0.58577611
## 17 0.33411879 -0.6373311 -0.6006817 1.175303 0.3305330 -0.85044265
## 23 0.08636389 -0.6373311 -0.6006817 1.175303 0.4406159 0.84958472
## 26 0.31322322 -0.6373311 -0.6006817 1.175303 -0.5833190 0.54010692
## 27 0.42121530 -0.6373311 -0.6006817 1.175303 0.2213265 0.30204708
## medv
## 7 0.03992492
## 16 -0.28626471
## 17 0.06167090
## 23 -0.79729513
## 26 -0.93864397
## 27 -0.64507330
Now we have build up test and training data by selecting randomly first 80 % off data as a training set and using 20 % as a test set. Both data set will be used in next part.
Now we use training set in linear discriminat analysis. The function tries hard to detect if the within-class covariance matrix is singular. If any variable has within-group variance less than ‘tolerance to decide’^2 it will stop and report the variable as constant. As a return function gives:
*prior - the prior probabilities used.
*means - the group means.
*scaling - a matrix which transforms observations to discriminant functions, normalized so that within groups covariance matrix is spherical.
*svd - the singular values, which give the ratio of the between- and within-group standard deviations on the linear discriminant variables. Their squares are the canonical F-statistics.
*N - The number of observations used.
*call - The (matched) function call.
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2524752 0.2301980 0.2673267
##
## Group means:
## zn indus chas nox rm age
## low 0.95720988 -0.9974152 -0.15538550 -0.9236781 0.4259313 -0.9151365
## med_low -0.09293868 -0.3511530 -0.04073494 -0.5975619 -0.1266931 -0.3659605
## med_high -0.39410920 0.2733794 0.19334945 0.3750116 0.1650971 0.4654724
## high -0.48724019 1.0169921 -0.05360128 1.0579430 -0.4086114 0.7968254
## dis rad tax ptratio black lstat
## low 0.9225903 -0.6885004 -0.7251658 -0.39837690 0.3884771 -0.77813734
## med_low 0.4378610 -0.5438774 -0.5293644 -0.05372787 0.3233714 -0.14698541
## med_high -0.4035150 -0.4051679 -0.2705798 -0.24915387 0.1170678 -0.07059714
## high -0.8470258 1.6393984 1.5149640 0.78225547 -0.8112050 0.88739423
## medv
## low 0.49831286
## med_low -0.01849074
## med_high 0.16899781
## high -0.71896935
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.07641944 0.593281114 -1.02217403
## indus 0.10765110 -0.696940487 0.34908894
## chas -0.08896789 -0.089544727 0.14830936
## nox 0.45028218 -0.589517652 -1.15508780
## rm -0.08567328 -0.069019878 -0.14263845
## age 0.22932653 -0.398133937 -0.23362567
## dis -0.07062677 -0.267905582 0.40947588
## rad 3.07547503 0.796406811 0.32208758
## tax 0.00738399 0.331248826 0.01112788
## ptratio 0.10860697 0.052156452 -0.20610157
## black -0.13334457 0.006753021 0.12613448
## lstat 0.22046003 -0.014002027 0.55635912
## medv 0.18489459 -0.200969317 -0.15167768
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9448 0.0420 0.0132
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2) +
lda.arrows(lda.fit, myscale = 1)
## integer(0)
From the results plot we can see that rad seems to most driving variable in our training set for explaining LD1. And from the result object we can see that it has most positive coefficient 3.62727 to LD1 whereas second most effective variable seems to be nox with only 0.307. For the LD2, zn has highest positive coefficient 0.624225492. ## Predict the classes with the LDA model on the test data
## $class
## [1] med_low med_low med_low med_low med_low med_low med_low low
## [9] med_low med_low med_low med_low med_high med_low med_high med_high
## [17] med_low med_high med_high med_high med_high med_high med_high med_high
## [25] med_high med_high med_high med_high med_high med_low med_low med_low
## [33] low low low low low low low low
## [41] med_low med_low med_low med_low med_low med_low med_low med_high
## [49] med_high med_high med_low med_high med_low low med_low low
## [57] med_high med_high med_high med_high med_low low med_low low
## [65] med_low med_low med_high med_low med_low med_low med_low low
## [73] med_low med_low med_low low low high high high
## [81] high high high high high high high high
## [89] high high high high high high high high
## [97] high med_high med_high med_high med_high med_low
## Levels: low med_low med_high high
##
## $posterior
## low med_low med_high high
## 7 1.116775e-01 7.133338e-01 1.749887e-01 2.009488e-15
## 16 2.042848e-01 6.035914e-01 1.921238e-01 2.978725e-16
## 17 3.252279e-01 5.753440e-01 9.942816e-02 5.257400e-17
## 23 5.703718e-02 6.031237e-01 3.398391e-01 1.126166e-14
## 26 6.346462e-02 6.082181e-01 3.283173e-01 1.716595e-14
## 27 6.152680e-02 5.661387e-01 3.723345e-01 7.040839e-15
## 35 4.132069e-02 4.980798e-01 4.605995e-01 1.109314e-13
## 40 9.952841e-01 4.518512e-03 1.974271e-04 4.085546e-21
## 54 4.544407e-01 5.402797e-01 5.279609e-03 1.237658e-19
## 76 9.344795e-02 8.756093e-01 3.094275e-02 4.314846e-17
## 106 7.010030e-02 4.999078e-01 4.299919e-01 5.594781e-13
## 107 6.366763e-02 5.685396e-01 3.677928e-01 6.723147e-13
## 109 9.678023e-02 4.167919e-01 4.864279e-01 1.046626e-13
## 110 8.265270e-02 5.203908e-01 3.969565e-01 1.965004e-13
## 112 6.771302e-02 3.816869e-01 5.506001e-01 3.365796e-13
## 116 3.958863e-02 4.601686e-01 5.002428e-01 3.982344e-12
## 120 7.863042e-02 6.321376e-01 2.892320e-01 4.879579e-13
## 121 3.461152e-03 2.074338e-01 7.891051e-01 5.221521e-17
## 122 2.170660e-03 1.412482e-01 8.565811e-01 7.444952e-17
## 123 1.137547e-03 1.229520e-01 8.759105e-01 2.801245e-16
## 139 1.708092e-03 9.037120e-02 9.079207e-01 9.576204e-13
## 142 5.543856e-04 1.534891e-01 8.459565e-01 6.919590e-11
## 143 1.342218e-05 2.503381e-03 9.974832e-01 3.539774e-11
## 146 6.576913e-06 5.819189e-04 9.994115e-01 3.092960e-09
## 147 1.152300e-05 3.441798e-04 9.996443e-01 4.153527e-10
## 148 1.251540e-05 1.789921e-03 9.981976e-01 1.909610e-09
## 153 4.364065e-05 1.622610e-03 9.983337e-01 1.803712e-12
## 157 1.455603e-05 4.124797e-04 9.995730e-01 6.925407e-10
## 158 8.305436e-04 1.667995e-02 9.824895e-01 1.108524e-13
## 178 2.801299e-01 5.482736e-01 1.715965e-01 8.088377e-16
## 179 2.600974e-01 4.848259e-01 2.550767e-01 1.680102e-15
## 185 1.991209e-01 6.695253e-01 1.313537e-01 8.343942e-17
## 189 9.442562e-01 5.400591e-02 1.737848e-03 2.294184e-18
## 191 8.993926e-01 9.715025e-02 3.457203e-03 2.940146e-18
## 193 9.196202e-01 7.693561e-02 3.444231e-03 1.285632e-18
## 197 9.946295e-01 5.298716e-03 7.178030e-05 6.441128e-24
## 198 9.910682e-01 8.834800e-03 9.698475e-05 3.904672e-23
## 199 9.922042e-01 7.676755e-03 1.189993e-04 2.800242e-23
## 204 9.971909e-01 2.184586e-03 6.244859e-04 2.291078e-19
## 205 9.974525e-01 1.917999e-03 6.295334e-04 1.724477e-19
## 207 1.017639e-01 7.970500e-01 1.011862e-01 4.073316e-17
## 210 7.861840e-03 8.728449e-01 1.192932e-01 6.098211e-16
## 212 9.425262e-03 9.112599e-01 7.931483e-02 3.274059e-16
## 213 3.743648e-02 9.174359e-01 4.512763e-02 7.551064e-18
## 215 2.693103e-02 9.611961e-01 1.187288e-02 5.711630e-16
## 216 1.452504e-01 7.766207e-01 7.812888e-02 2.257245e-17
## 217 1.929481e-02 7.442701e-01 2.364351e-01 4.560535e-16
## 218 1.270115e-02 2.391894e-01 7.481094e-01 1.742492e-14
## 225 3.210085e-02 1.323957e-01 8.355035e-01 1.262223e-11
## 226 1.696488e-02 7.936452e-02 9.036706e-01 1.940899e-11
## 230 2.462753e-01 6.039777e-01 1.497470e-01 4.286857e-13
## 233 4.375921e-02 1.720882e-01 7.841526e-01 3.290605e-12
## 236 6.963796e-02 6.504862e-01 2.798759e-01 1.345188e-11
## 252 5.060448e-01 4.841532e-01 9.801980e-03 2.732455e-16
## 254 3.817954e-01 5.742306e-01 4.397404e-02 5.112891e-16
## 257 9.960235e-01 3.612501e-03 3.639795e-04 4.737757e-21
## 259 4.093459e-02 2.190876e-02 9.371567e-01 1.868553e-13
## 262 3.654754e-02 1.952215e-02 9.439303e-01 1.722002e-13
## 264 6.195266e-02 4.363880e-02 8.944085e-01 1.754541e-13
## 266 3.302785e-01 1.717248e-01 4.979967e-01 8.387289e-14
## 270 2.827773e-01 6.922895e-01 2.493319e-02 2.080806e-19
## 280 6.678464e-01 3.155085e-01 1.664508e-02 6.195425e-18
## 283 3.814058e-01 5.454876e-01 7.310664e-02 1.916194e-18
## 292 9.941298e-01 5.178059e-03 6.921040e-04 3.044488e-19
## 294 6.570856e-02 9.244183e-01 9.873107e-03 8.605026e-20
## 297 4.372992e-02 9.139900e-01 4.228006e-02 6.444885e-19
## 313 7.439606e-02 4.617269e-01 4.638770e-01 2.903775e-15
## 321 1.996487e-01 6.829992e-01 1.173521e-01 3.522681e-16
## 324 9.137122e-02 7.658952e-01 1.427335e-01 3.447146e-15
## 325 2.394066e-01 6.762450e-01 8.434838e-02 1.380862e-16
## 330 4.836940e-01 5.131962e-01 3.109807e-03 7.444364e-20
## 333 8.615764e-01 1.377343e-01 6.892534e-04 1.079480e-23
## 334 2.247274e-01 6.806355e-01 9.463703e-02 9.427207e-17
## 338 1.355479e-01 7.307107e-01 1.337414e-01 1.271498e-15
## 340 2.188496e-01 6.913376e-01 8.981275e-02 5.386374e-16
## 348 9.903906e-01 9.335274e-03 2.740760e-04 5.066698e-20
## 352 9.379721e-01 6.131054e-02 7.173858e-04 2.934268e-20
## 367 2.476981e-20 3.620420e-17 4.238775e-13 1.000000e+00
## 369 6.998706e-20 6.701983e-17 1.529547e-12 1.000000e+00
## 371 1.782481e-17 1.945155e-14 3.153721e-10 1.000000e+00
## 386 9.030824e-21 6.238381e-17 9.334043e-14 1.000000e+00
## 391 2.736276e-19 4.947836e-16 2.329180e-12 1.000000e+00
## 396 9.855294e-19 1.402757e-15 6.291107e-12 1.000000e+00
## 399 2.002490e-20 1.310909e-16 1.618393e-13 1.000000e+00
## 415 1.194680e-23 2.096343e-19 5.037787e-16 1.000000e+00
## 416 9.322605e-22 4.669631e-18 2.080823e-14 1.000000e+00
## 428 1.759625e-19 1.483625e-16 7.509318e-13 1.000000e+00
## 434 2.990078e-20 3.805701e-17 4.898208e-13 1.000000e+00
## 440 1.993891e-20 5.728044e-17 3.693258e-13 1.000000e+00
## 444 6.164393e-20 1.108200e-16 1.513367e-12 1.000000e+00
## 450 8.332681e-20 1.664108e-16 1.275593e-12 1.000000e+00
## 461 1.638814e-19 2.421737e-16 2.359750e-12 1.000000e+00
## 474 7.254837e-17 7.385543e-14 8.711083e-11 1.000000e+00
## 475 7.999821e-18 2.490216e-14 8.542058e-12 1.000000e+00
## 476 1.332341e-18 6.493563e-15 2.450574e-12 1.000000e+00
## 477 8.764637e-18 2.156858e-14 1.670989e-11 1.000000e+00
## 485 2.000381e-15 2.907725e-12 2.628919e-10 1.000000e+00
## 491 9.308589e-04 2.597028e-01 7.393663e-01 1.806241e-11
## 493 2.891066e-03 1.236607e-01 8.734482e-01 1.296464e-13
## 497 2.341412e-02 4.505605e-01 5.260254e-01 2.829887e-11
## 502 2.199584e-01 3.686441e-01 4.113975e-01 7.633848e-19
## 506 2.788546e-01 4.017031e-01 3.194423e-01 3.300769e-19
##
## $x
## LD1 LD2 LD3
## 7 -2.1153438 -0.367901917 0.60398591
## 16 -2.3676396 -0.375366624 0.09105350
## 17 -2.5827588 -0.067441975 0.08415944
## 23 -1.8808423 -0.754413526 0.51278552
## 26 -1.8373617 -0.651161446 0.48581655
## 27 -1.9391051 -0.816408300 0.37293279
## 35 -1.5927214 -0.772841977 0.38570442
## 40 -3.4366062 2.425925315 -2.34128404
## 54 -3.2526389 0.798851930 1.07525816
## 76 -2.5308783 -0.039954423 1.59767126
## 106 -1.4325477 -0.359090649 0.16896688
## 107 -1.4080716 -0.307145288 0.40495443
## 109 -1.6415585 -0.465655580 -0.22795932
## 110 -1.5644384 -0.370402164 0.14916532
## 112 -1.4832999 -0.536153224 -0.17413636
## 116 -1.1699358 -0.441314009 0.32965126
## 120 -1.4575458 -0.148514823 0.49848714
## 121 -2.3206070 -2.798614010 0.51592639
## 122 -2.2381038 -2.971282858 0.36186643
## 123 -2.0410621 -3.088981886 0.55850315
## 139 -1.1008719 -2.062661253 0.12695970
## 142 -0.5572067 -2.025984236 1.24598637
## 143 -0.2555841 -3.518403993 -0.77195107
## 146 0.3682290 -3.276907739 -1.72420183
## 147 0.1234483 -3.257738793 -2.50674035
## 148 0.2285219 -3.107811247 -1.01209187
## 153 -0.6526576 -3.363647041 -1.78754905
## 157 0.1623817 -3.116205975 -2.45184640
## 158 -1.2430546 -2.566351066 -1.12562615
## 178 -2.2627654 -0.088015205 -0.09778702
## 179 -2.1752600 -0.221892497 -0.33924517
## 185 -2.5122614 -0.345957155 0.35280215
## 189 -2.8367452 1.989588293 -0.90989546
## 191 -2.8420392 1.656109360 -0.63878566
## 193 -2.9302303 1.584350877 -0.87085264
## 197 -4.1788730 2.204291467 -1.81295547
## 198 -3.9948107 2.240135078 -1.45359844
## 199 -4.0316561 2.112429413 -1.67521688
## 204 -2.9569406 2.337583807 -3.47573242
## 205 -2.9849004 2.307072267 -3.60196275
## 207 -2.5603318 -0.568520988 0.95555825
## 210 -2.1050428 -1.362848827 2.27090508
## 212 -2.1823619 -1.167356634 2.39109454
## 213 -2.6914306 -0.765629062 1.91924868
## 215 -2.1443862 0.198308420 2.74191428
## 216 -2.6437572 -0.369683832 0.86051377
## 217 -2.1960588 -1.359602699 1.37706015
## 218 -1.7203538 -1.642532403 0.07195037
## 225 -0.9807610 -0.605710244 -0.92727213
## 226 -0.8743177 -0.832088700 -1.11125812
## 230 -1.5236774 0.597308107 0.16964324
## 233 -1.1652914 -0.606253637 -0.82431850
## 236 -1.0635143 0.175095147 0.62879946
## 252 -2.3654996 1.380397533 0.72015484
## 254 -2.3109051 0.624713701 0.37490516
## 257 -3.4212706 2.160179094 -2.81140718
## 259 -1.4143553 -0.965874758 -2.79573160
## 262 -1.4127853 -1.019245271 -2.84940277
## 264 -1.4731059 -0.807707613 -2.34702261
## 266 -1.7008663 0.006647663 -1.67165694
## 270 -3.2037471 -0.070375069 0.87445463
## 280 -2.8158487 0.844809765 -0.07576998
## 283 -2.9711072 -0.214497123 0.05935791
## 292 -2.9613163 2.293774441 -2.71953681
## 294 -3.2188923 -0.309737109 2.26312467
## 297 -2.9866017 -0.938404478 1.84480769
## 313 -2.0490142 -0.934878448 -0.01290974
## 321 -2.3428064 -0.137499837 0.43120716
## 324 -2.0400476 -0.294640678 0.86266098
## 325 -2.4561360 -0.010958150 0.46590194
## 330 -3.3037467 1.019662802 1.22101507
## 333 -4.2874217 1.043863374 0.29258386
## 334 -2.4995375 -0.131273598 0.45038028
## 338 -2.1759322 -0.215048042 0.64166414
## 340 -2.2939822 0.069929531 0.51589149
## 348 -3.1782122 2.519290545 -1.78881394
## 352 -3.3349414 1.931745633 -0.44506710
## 367 6.3376844 0.093591647 -0.71571070
## 369 6.2293590 -0.121282243 -1.22089442
## 371 5.5801966 -0.621089607 -1.05742940
## 386 6.4002977 0.395294305 0.94475761
## 391 6.0608661 0.158412330 -0.23838124
## 396 5.9263183 0.163154291 -0.34609478
## 399 6.3139659 0.427169306 0.99452583
## 415 7.1104607 0.418406591 1.25537523
## 416 6.6649642 0.285724345 0.33505733
## 428 6.1569011 0.553207429 -0.64027535
## 434 6.3222346 0.097894600 -0.82618685
## 440 6.3334831 0.060243006 -0.12429065
## 444 6.2158615 -0.180491230 -0.68866106
## 450 6.1849368 0.006745378 -0.39002302
## 461 6.1195238 -0.028491153 -0.64747970
## 474 5.4684501 0.498158194 0.02927057
## 475 5.6818552 0.758534645 1.13100383
## 476 5.8627466 0.682073930 1.32542854
## 477 5.6703027 0.482178797 0.66263098
## 485 5.1065701 1.174210344 1.28550061
## 491 -0.7631106 -1.918544893 1.51905993
## 493 -1.3769112 -2.061830919 0.15283869
## 497 -0.9108578 -0.459711910 0.56793353
## 502 -3.0628150 -1.332689893 -0.78241737
## 506 -3.1732525 -1.212914752 -0.71960480
## [1] 0.4215686
## predicted
## correct low med_low med_high high
## low 11 11 4 0
## med_low 5 12 7 0
## med_high 0 15 17 1
## high 0 0 0 19
The results shows as that the high and low variable groups are grouped much better to rigth classes compared to med_high and med_low groups. Overall our error rate is still around 33 % depending how observations are divided to train and test data. This is pretty high and could be better.
To get better error rate we can try to do standardise our data set by using distances. * Scale all variables to have mean = 0 and standard deviation = 1
# Bring data to R
library(MASS)
library(dplyr)
data(Boston)
# Standaridise data
standardised_boston <- as.data.frame(Boston %>%mutate_all(~(scale(.) %>% as.vector)))
# Calculate distances
## Euclidean distance matrix
dist_eu <- dist(standardised_boston, method = "euclidean")
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(standardised_boston, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
set.seed(13)
# K-means clustering, center 4
km <- kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
# Summary of clusters
summary(km$cluster)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 2.269 2.000 4.000
# K-means clustering, center 3
km <- kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)
pairs(Boston[6:10], col = km$cluster)
pairs(Boston[11:ncol(Boston)], col = km$cluster)
# Summary of clusters
summary(km$cluster)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.929 2.000 3.000
# K-means clustering, center 2
km <- kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)
pairs(Boston[6:10], col = km$cluster)
pairs(Boston[11:ncol(Boston)], col = km$cluster)
# Summary of clusters
summary(km$cluster)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.729 2.000 2.000
# K-means clustering, center 2
km <- kmeans(Boston, centers = 5)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)
pairs(Boston[6:10], col = km$cluster)
pairs(Boston[11:ncol(Boston)], col = km$cluster)
# Summary of clusters
summary(km$cluster)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 3.000 2.666 3.000 5.000
# K-means clustering, center 2
km <- kmeans(Boston, centers = 6)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)
pairs(Boston[6:10], col = km$cluster)
pairs(Boston[11:ncol(Boston)], col = km$cluster)
# Summary of clusters
summary(km$cluster)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 4.312 6.000 6.000
From the summaries, we can see that distances between used method differs quite much qiven longer distances by Manhattan method. Euclidean distance is the shortest path between source and destination. Manhattan distance is sum of all the real distances between source and destination. Two clusters seems to be best. By using center 4, number of cluster range between 1-4, and 2 seems to be the most common. When studying by setting higher value for center, 4 cluster seems to be most common choice. So, it seems to be hard to use summary as a only way to select good number of clusters. But from the plots we can see that 4 clusters could be presented in the data, but even that atleast 2 and 3 clusters should be also tried.
data(Boston)
# Standaridise data
standardised_boston <- as.data.frame(Boston %>%mutate_all(~(scale(.) %>% as.vector)))
# K-means clustering, center 3
km <- kmeans(standardised_boston, centers = 3)
# plot the Boston dataset with clusters
pairs(standardised_boston[1:5], col = km$cluster)
pairs(standardised_boston[6:10], col = km$cluster)
pairs(standardised_boston[11:ncol(standardised_boston)], col = km$cluster)
# Summary of clusters
summary(km$cluster)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.743 2.000 3.000
standardised_boston$N_clusters <- factor(km$cluster)
# Build up test and train datasets
# choose randomly 80% of the rows
ind <- sample(nrow(standardised_boston), size = n * 0.8)
# create train set
train <- standardised_boston[ind,]
dim(train)
## [1] 404 15
# create test set
test <- standardised_boston[-ind,]
dim(test)
## [1] 102 15
# save the 'correct' clusters from test data
correct_classes <- test$N_clusters
# remove the N_clusters variable from test data
test <- dplyr::select(test, -N_clusters)
head(test)
## crim zn indus chas nox rm
## 3 -0.4169290 -0.48724019 -0.5927944 -0.2723291 -0.7395304 1.2814456
## 8 -0.4032966 0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.1603069
## 10 -0.4003331 0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.3994130
## 11 -0.3939564 0.04872402 -0.4761823 -0.2723291 -0.2648919 0.1314594
## 21 -0.2745709 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -1.0171035
## 23 -0.2768170 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.2030044
## age dis rad tax ptratio black lstat
## 3 -0.2655490 0.556609050 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324
## 8 0.9778406 1.023624897 -0.5224844 -0.5769480 -1.5037485 0.4406159 0.9097999
## 10 0.6154813 1.328320207 -0.5224844 -0.5769480 -1.5037485 0.3289995 0.6227277
## 11 0.9138948 1.211779950 -0.5224844 -0.5769480 -1.5037485 0.3926395 1.0918456
## 21 1.0488914 0.001356935 -0.6373311 -0.6006817 1.1753027 0.2179309 1.1716657
## 23 0.8215287 0.086363887 -0.6373311 -0.6006817 1.1753027 0.4406159 0.8495847
## medv
## 3 1.3229375
## 8 0.4965904
## 10 -0.3949946
## 11 -0.8190411
## 21 -0.9712629
## 23 -0.7972951
# linear discriminant analysis
lda.fit <- lda(N_clusters ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(N_clusters ~ ., data = train)
##
## Prior probabilities of groups:
## 1 2 3
## 0.4702970 0.3267327 0.2029703
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3724278 -0.3387499 -0.2829898 0.03849464 -0.3098577 -0.07912173
## 2 0.7707649 -0.4872402 1.1304030 0.05576262 1.1890969 -0.51987167
## 3 -0.4082840 1.5687708 -1.1015857 -0.08027540 -1.0112004 0.88743088
## age dis rad tax ptratio black
## 1 0.02723655 0.01214864 -0.5786988 -0.6052722 -0.1000423 0.2679608
## 2 0.82255145 -0.85370946 1.1680242 1.2695565 0.5468315 -0.5919408
## 3 -1.17655926 1.26178915 -0.6037174 -0.6590027 -0.7421679 0.3540631
## lstat medv
## 1 -0.1783253 0.06127032
## 2 0.8633549 -0.69424886
## 3 -0.9375719 0.95206252
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim -0.05143915 0.128168802
## zn -0.05754639 1.091843941
## indus 0.49017626 0.071189895
## chas 0.04698341 -0.068524065
## nox 1.06863590 0.719733618
## rm -0.15043419 0.380911844
## age -0.12108442 -0.593230736
## dis -0.12373888 0.564951427
## rad 0.69328145 0.007068937
## tax 0.92825702 0.816155360
## ptratio 0.24143922 0.085061886
## black 0.01470321 -0.013028639
## lstat 0.17568563 0.484228519
## medv -0.06969142 0.626189685
##
## Proportion of trace:
## LD1 LD2
## 0.8378 0.1622
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$N_clusters)
# plot the lda results
plot(lda.fit, dimen = 2) +
lda.arrows(lda.fit, myscale = 1)
## integer(0)
Plot of the LDA results shows that tax and rad has most effect to LD1 and age for LD2, and even so are the most influential linear separators for the clusters.
# Add crime factor to data set scaled in last code chunk
# create a quantile vector of crim and print it
bins <- quantile(standardised_boston[['crim']])
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(standardised_boston[['crim']], breaks = bins, include.lowest = TRUE)
# remove original crim from the dataset
standardised_boston <- dplyr::select(standardised_boston, -crim)
# add the new categorical value to scaled data
standardised_boston <- data.frame(standardised_boston, crime)
# Set crime to character fow a while
standardised_boston$crime<-as.character(standardised_boston$crime)
# Set 'crime' to factor variable, and give relevant levels
standardised_boston$crime[standardised_boston$crime=="[-0.419,-0.411]"] <- "low"
standardised_boston$crime[standardised_boston$crime=="(-0.411,-0.39]"] <- "med_low"
standardised_boston$crime[standardised_boston$crime=="(-0.39,0.00739]"] <- "med_high"
standardised_boston$crime[standardised_boston$crime=="(0.00739,9.92]"] <- "high"
# back to factor with correct levels
standardised_boston$crime<- factor(standardised_boston$crime, levels = c("low", "med_low", "med_high", "high"))
# Check head of edited table
head(standardised_boston)
## zn indus chas nox rm age dis
## 1 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948 0.140075
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034 0.556609
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490 0.556609
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.076671
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743 1.076671
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100 1.076671
## rad tax ptratio black lstat medv N_clusters
## 1 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990 0.1595278 1
## 2 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239 1
## 3 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324 1.3229375 1
## 4 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708 1.1815886 3
## 5 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866 1.4860323 3
## 6 -0.7521778 -1.1050216 0.1129203 0.4101651 -1.0422909 0.6705582 1
## crime
## 1 low
## 2 low
## 3 low
## 4 low
## 5 low
## 6 low
# Build up test and train datasets
# choose randomly 80% of the rows
ind <- sample(nrow(standardised_boston), size = n * 0.8)
# create train set
train <- standardised_boston[ind,]
dim(train)
## [1] 404 15
# create test set
test <- standardised_boston[-ind,]
dim(test)
## [1] 102 15
# Save n_cluster column
n_clusterit<- train$N_clusters
# remove original n_cluster from the dataset
train <- dplyr::select(train, -N_clusters)
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2500000 0.2574257 0.2425743
##
## Group means:
## zn indus chas nox rm age
## low 1.08456771 -0.9369443 -0.11640431 -0.8924742 0.5212608 -0.8766213
## med_low -0.09115733 -0.3312543 0.03952046 -0.5917217 -0.1115500 -0.3625932
## med_high -0.39571399 0.1510421 0.14409500 0.3846657 0.1429959 0.4294170
## high -0.48724019 1.0149946 -0.03128211 1.0406720 -0.4855777 0.7957547
## dis rad tax ptratio black lstat
## low 0.8529692 -0.6987343 -0.7602375 -0.49441664 0.3799540 -0.79526046
## med_low 0.3642410 -0.5554602 -0.5409951 -0.04120056 0.3201782 -0.17644351
## med_high -0.3749755 -0.3822389 -0.2825025 -0.26326606 0.1074278 -0.05752024
## high -0.8584374 1.6596029 1.5294129 0.80577843 -0.7464259 0.89613928
## medv
## low 0.5929402
## med_low 0.0102126
## med_high 0.2096063
## high -0.6231054
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11030024 0.90477115 -0.9432520
## indus -0.02369693 -0.21284527 0.3534688
## chas -0.06681638 -0.04469796 0.1555483
## nox 0.32271479 -0.76681432 -1.3321508
## rm -0.11928537 -0.08149308 -0.1426609
## age 0.35340567 -0.29789983 -0.2669457
## dis -0.08563196 -0.47438844 0.2270172
## rad 3.09835255 1.17955576 0.1675980
## tax 0.10830898 -0.36338770 0.2630112
## ptratio 0.15263545 0.06163493 -0.2299144
## black -0.17731254 0.01064261 0.1035738
## lstat 0.14115142 -0.15376102 0.3960080
## medv 0.18691351 -0.36944033 -0.1857561
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9436 0.0416 0.0148
# Collect crime variables
target<- factor(train$crime)
# Collect predictor data
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# Next, access the plotly package. Create a 3D plot of the columns of the matrix.
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=target)
# Let's colorcode clusters
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=n_clusterit)
(more chapters to be added similarly as we proceed with the course!)